Análisis exploratorio SUDMEX CONN

Author

Edgar

Published

October 8, 2022

Preparar variables

Definimos las variables para la extracción de los datos de la pagina de Zenodo. Creamos el directorio donde se descargaran los datos y los descargamos directamente.

Code
from funciones import get_files_url, download_xl_file

sitio = 'https://zenodo.org/'
page = 'record/5123330'
url_list = get_files_url(f'{sitio}{page}')
download_xl_file('descargas', url_list, sitio)
2022-10-16 21:24:34,809 - Process Data - INFO - Connectome_ASI_050219.xlsx downloaded
2022-10-16 21:24:37,577 - Process Data - INFO - Connectome_BIS_050219.xlsx downloaded
2022-10-16 21:24:40,005 - Process Data - INFO - Connectome_CCQN_050219.xlsx downloaded
2022-10-16 21:24:42,658 - Process Data - INFO - Connectome_CGI_050219.xlsx downloaded
2022-10-16 21:24:45,915 - Process Data - INFO - Connectome_demographics_050219.xlsx downloaded
2022-10-16 21:24:48,815 - Process Data - INFO - Connectome_DERS_050219.xlsx downloaded
2022-10-16 21:24:52,798 - Process Data - INFO - Connectome_DES_050219.xlsx downloaded
2022-10-16 21:24:55,103 - Process Data - INFO - Connectome_DIB_050219.xlsx downloaded
2022-10-16 21:24:57,451 - Process Data - INFO - Connectome_DSS4_050219.xlsx downloaded
2022-10-16 21:25:00,794 - Process Data - INFO - Connectome_instantview_050219.xlsx downloaded
2022-10-16 21:25:09,597 - Process Data - INFO - Connectome_MINI_050219.xlsx downloaded
2022-10-16 21:25:13,497 - Process Data - INFO - Connectome_SCID_050219.xlsx downloaded
2022-10-16 21:25:16,491 - Process Data - INFO - Connectome_SCL90__050219.xlsx downloaded
2022-10-16 21:25:19,190 - Process Data - INFO - Connectome_WHODAS_050219.xlsx downloaded
2022-10-16 21:25:21,636 - Process Data - INFO - participants.xlsx downloaded

Import Data

Podemos leer la información directamente del archivo excel y ejecutar algún preprocesamiento de los datos. Cabe mencionar que algunos archivos de excel pueden requerir engines específicas para leerse correctamente. Si el método read_excel no puede leerlo directamente vale la pena tratar con openxl. > Aunque la recomendación base es evitar los formatos cerrados y usar csv, csv/zip parquet, hdf5.

Code
import pathlib
import pandas as pd
import numpy as np


parent_path = pathlib.Path().parent.resolve().parent
demographics_file = parent_path.joinpath('descargas', 'Connectome_demographics_050219.xlsx')
demo_data = pd.read_excel(demographics_file, sheet_name="Demographics")
dictionary_data = pd.read_excel(demographics_file, sheet_name="Connectome_demographics")
demo_data['income'] = pd.to_numeric(demo_data.income, errors='coerce')
demo_data['prof_mental'] = demo_data.prof_mental.astype('category')
demo_data['support_years'] = pd.to_numeric(demo_data.support_years, errors='coerce')
demo_data['work_threeyears'] = pd.to_numeric([ str(value).replace(',', '.') for value in demo_data.work_threeyears], errors='coerce')
demo_data.head()
rid group demo sex age educ occup income civil_st child ... work_thirtydays work_threeyears energy_freq energy_recent energy_cans laterality ed_score amai amai_score notes
0 1 1 1 1.0 41.0 7.0 7.0 20000.0 1.0 0.0 ... 1.0 1.0 0.0 0.0 0.0 1.0 87.5 7.0 222.0 NaN
1 2 2 1 1.0 23.0 5.0 3.0 7100.0 6.0 0.0 ... 1.0 4.0 0.0 0.0 0.0 1.0 87.5 3.0 104.0 NaN
2 3 2 1 1.0 27.0 5.0 4.0 12000.0 6.0 0.0 ... 4.0 1.0 0.0 0.0 0.0 2.0 -87.5 7.0 219.0 NaN
3 4 1 1 1.0 27.0 5.0 7.0 10000.0 6.0 0.0 ... 2.0 2.0 0.0 0.0 0.0 1.0 100.0 4.0 110.0 NaN
4 5 1 1 1.0 23.0 5.0 7.0 6000.0 6.0 0.0 ... 2.0 4.0 0.0 0.0 0.0 1.0 100.0 3.0 94.0 NaN

5 rows × 28 columns

Herramientas EDA

Skim

La herramienta existe tanto en R como en Python y es de uso sencillo que se integra de forma transparente dentro de jupyter notebook

Code
from skimpy import skim

skim(demo_data)
/home/nekrum/proyectos/datavix_lanirem/sudmex_conn/python/.env/lib/python3.10/site-packages/numpy/lib/histograms.py:906: RuntimeWarning: invalid value encountered in divide
  return n/db/n.sum(), bin_edges
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 139    │ │ float64     │ 24    │ │ prof_mental           │                                │
│ │ Number of columns │ 28     │ │ int64       │ 3     │ └───────────────────────┘                                │
│ └───────────────────┴────────┘ │ category    │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓  │
│ ┃ column_name             NA     NA %    mean    sd      p0      p25     p75     p100      hist     ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩  │
│ │ rid                       0     0    75    46     1    36   110     160 █▇▇▆▇▅  │  │
│ │ group                     0     0   1.5   0.5     1     1     2       2 ▇    █  │  │
│ │ demo                      0     0  0.81   0.4     0     1     1       1 ▂    █  │  │
│ │ sex                       2   1.4   1.1  0.35     1     1     1       2 █    ▁  │  │
│ │ age                       2   1.4    31   7.7    18    24    37      50 █▇▆█▃▁  │  │
│ │ educ                      6   4.3   3.2   1.4     1     2     5       7 ▂█▅▃▆   │  │
│ │ occup                     8   5.8   3.6     2     0     3     4       9 ▂ █ ▂   │  │
│ │ income                   33    24  6300  6800     0  2800  8000   50000   █▂    │  │
│ │ civil_st                 30    22   3.8   2.1     0     2     6       6  ▃▅ ▃█  │  │
│ │ child                    32    23  0.53   0.5     0     0     1       1 ▇    █  │  │
│ │ child_num                35    25  0.96   1.1     0     0     2       4 █▄ ▃▂   │  │
│ │ bro_num                  40    29   2.7   1.7     0     2     3      10 ▃█▂ ▁   │  │
│ │ place_bro                36    26   2.1   1.6     0     1     3       9 █▆▄ ▁   │  │
│ │ years_mental             43    31     2     4     0     0     2      20  █▁▁    │  │
│ │ hosp_subst               47    34  0.74   1.6     0     0     1       7   █▁    │  │
│ │ support_ever             40    29   0.3  0.48     0     0     1       2  █  ▃   │  │
│ │ support_years            41    29  0.45   1.4     0     0     0      10 │  │
│ │ work_thirtydays          58    42   2.3   1.5     0     1     3       7 █▅▄▂▁▁  │  │
│ │ work_threeyears          56    40   2.5   1.5     0     1     3       7 █▆▆▃▁▁  │  │
│ │ energy_freq              40    29  0.24  0.55     0     0     0       2 █  ▁ ▁  │  │
│ │ energy_recent            40    29  0.16  0.37     0     0     0       1 █    ▂  │  │
│ │ energy_cans              41    29  0.24  0.64     0     0     0       3  █ ▁    │  │
│ │ laterality                2   1.4   1.2  0.52     1     1     1       3 █  ▁ ▁  │  │
│ │ ed_score                  4   2.9    86    39  -100    88   100     100 │  │
│ │ amai                     11   7.9   4.1   1.8     1     2     6       7 ▁▇▅▃▅█  │  │
│ │ amai_score               11   7.9   120    52    16    79   170     240 ▂█▅▆▆▂  │  │
│ │ notes                   140   100   nan   nan   nan   nan   nan     nan         │  │
│ └────────────────────────┴───────┴────────┴────────┴────────┴────────┴────────┴────────┴──────────┴──────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                       NA         NA %            ordered                unique             ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩  │
│ │ prof_mental                            42            30False                                21 │  │
│ └──────────────────────────────────┴───────────┴────────────────┴───────────────────────┴────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

Sweetviz

Esta herramienta genera un reporte html que puede insertarse mediante widgets o iframe dentro del notebook. Su interfaz esta un poco mas dearrollada y es interactiva. Ademas cuenta con una sección de asociaciones que permite revizar las relaciones entre variables.

Code
import sweetviz as sv
import warnings
warnings.filterwarnings('ignore')

result = sv.analyze(demo_data)
result.show_notebook()

Notas

Existen varios paquetes que aportan procesos similares y que van mas alla de un info() o un describe pero estas herramientas se enfocan en mostrar una vista rápida de la distribución de variables, valores faltantes y hasta relaciones entre las variables. Una herramienta que consideré incluir es pandas-profiling, puede que en una actualización lo haga, pero de momento presenta un error al cargar librerías de pandas.

Sobre Quarto

En esta prueba de concepto, la renderización de una página web a partir de este notebook a sido directa. Es decir usando los mismo paramétros que en R dentro del chunck inicial, el resultado es dificilmente diferenciable del resultado en R. Lo que me parece relevante es el hecho de que consolo insertar un chunck de codigo al inicio se puede generar un documento PDF, Word, Presentación o una página. Existen alternativas para exportar un jupyternotebook. Pero el que una herramienta funcione en ambos lenguajes y mantenga un estilo simplifica el trabajo.

NOTA: La sección de DataPrep, precede a esta sección de notas, sin embargo por la forma en que se renderea el output de DataPrep se pierde el estilo de secciones posteriores. Esto se puede resolver usando las secciones de DataPrep por separado pero salía del alcance de esta prueba de concepto.

DataPrep

Al igual que SeetViz, este paquete permite generar un reporte detallado de las variables en el dataframe. Y como bonus cuenta con algunos métodos de preprocesamiento que puede ser útil para acelerar el análisis.

Code
from dataprep.eda import create_report


create_report(demo_data).show()
DataPrep Report

Overview

Dataset Statistics

Number of Variables 28
Number of Rows 139
Missing Cells 839
Missing Cells (%) 21.6%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 31.3 KB
Average Row Size in Memory 230.5 B
Variable Types
  • Numerical: 11
  • Categorical: 17

Dataset Insights

rid is uniformly distributed Uniform
bro_num and work_threeyears have similar distributions Similar Distribution
sex has 2 (1.44%) missing values Missing
age has 2 (1.44%) missing values Missing
educ has 6 (4.32%) missing values Missing
occup has 8 (5.76%) missing values Missing
income has 33 (23.74%) missing values Missing
civil_st has 30 (21.58%) missing values Missing
child has 32 (23.02%) missing values Missing
child_num has 35 (25.18%) missing values Missing
bro_num has 40 (28.78%) missing values Missing
place_bro has 36 (25.9%) missing values Missing
prof_mental has 42 (30.22%) missing values Missing
years_mental has 43 (30.94%) missing values Missing
hosp_subst has 47 (33.81%) missing values Missing
support_ever has 40 (28.78%) missing values Missing
support_years has 41 (29.5%) missing values Missing
work_thirtydays has 58 (41.73%) missing values Missing
work_threeyears has 56 (40.29%) missing values Missing
energy_freq has 40 (28.78%) missing values Missing
energy_recent has 40 (28.78%) missing values Missing
energy_cans has 41 (29.5%) missing values Missing
laterality has 2 (1.44%) missing values Missing
ed_score has 4 (2.88%) missing values Missing
amai has 11 (7.91%) missing values Missing
amai_score has 11 (7.91%) missing values Missing
notes has 139 (100.0%) missing values Missing
occup is skewed Skewed
income is skewed Skewed
bro_num is skewed Skewed
place_bro is skewed Skewed
years_mental is skewed Skewed
support_years is skewed Skewed
work_threeyears is skewed Skewed
ed_score is skewed Skewed
group has constant length 1 Constant Length
demo has constant length 1 Constant Length
sex has constant length 3 Constant Length
educ has constant length 3 Constant Length
civil_st has constant length 3 Constant Length
child has constant length 3 Constant Length
child_num has constant length 3 Constant Length
support_ever has constant length 3 Constant Length
work_thirtydays has constant length 3 Constant Length
energy_freq has constant length 3 Constant Length
energy_recent has constant length 3 Constant Length
energy_cans has constant length 3 Constant Length
laterality has constant length 3 Constant Length
amai has constant length 3 Constant Length
notes has all distinct values Unique
ed_score has 5 (3.6%) negatives Negatives
occup has 8 (5.76%) zeros Zeros
income has 11 (7.91%) zeros Zeros
years_mental has 51 (36.69%) zeros Zeros
support_years has 76 (54.68%) zeros Zeros
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Variables


rid

numerical

Approximate Distinct Count 139
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 2224
Mean 75.2734
Minimum 1
Maximum 160
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • rid is uniformly distributed
  • rid is skewed right (γ1 = 0.1019)

Quantile Statistics

Minimum 1
5-th Percentile 7.9
Q1 35.5
Median 73
Q3 114.5
95-th Percentile 146.1
Maximum 160
Range 159
IQR 79

Descriptive Statistics

Mean 75.2734
Standard Deviation 45.5645
Variance 2076.1276
Sum 10463
Skewness 0.1019
Kurtosis -1.2022
Coefficient of Variation 0.6053

group

categorical

Approximate Distinct Count 2
Approximate Unique (%) 1.4%
Missing 0
Missing (%) 0.0%
Memory Size 9174

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 1
2nd row 2
3rd row 2
4th row 1
5th row 1

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 139
  • The top 2 categories (2, 1) take over 50.0%
  • group has words of constant length

demo

categorical

Approximate Distinct Count 2
Approximate Unique (%) 1.4%
Missing 0
Missing (%) 0.0%
Memory Size 9174
  • The largest value (1) is over 4.15 times larger than the second largest value (0)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 1
2nd row 1
3rd row 1
4th row 1
5th row 1

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 139
  • The top 2 categories (1, 0) take over 50.0%
  • The largest value (1) is over 4.15 times larger than the second largest value (0)
  • demo has words of constant length

sex

categorical

Approximate Distinct Count 2
Approximate Unique (%) 1.5%
Missing 2
Missing (%) 1.4%
Memory Size 9316
  • The largest value (1.0) is over 5.85 times larger than the second largest value (2.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 1.0
2nd row 1.0
3rd row 1.0
4th row 1.0
5th row 1.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 274
  • The top 2 categories (1.0, 2.0) take over 50.0%
  • The largest value (10) is over 5.85 times larger than the second largest value (20)
  • sex has words of constant length

age

numerical

Approximate Distinct Count 31
Approximate Unique (%) 22.6%
Missing 2
Missing (%) 1.4%
Infinite 0
Infinite (%) 0.0%
Memory Size 2192
Mean 30.7518
Minimum 18
Maximum 50
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • age is skewed right (γ1 = 0.2998)

Quantile Statistics

Minimum 18
5-th Percentile 19.8
Q1 24
Median 30
Q3 37
95-th Percentile 44
Maximum 50
Range 32
IQR 13

Descriptive Statistics

Mean 30.7518
Standard Deviation 7.7039
Variance 59.3497
Sum 4213
Skewness 0.2998
Kurtosis -0.8158
Coefficient of Variation 0.2505

educ

categorical

Approximate Distinct Count 6
Approximate Unique (%) 4.5%
Missing 6
Missing (%) 4.3%
Memory Size 9044

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 7.0
2nd row 5.0
3rd row 5.0
4th row 5.0
5th row 5.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 266
  • The top 2 categories (2.0, 5.0) take over 50.0%
  • educ has words of constant length

occup

numerical

Approximate Distinct Count 10
Approximate Unique (%) 7.6%
Missing 8
Missing (%) 5.8%
Infinite 0
Infinite (%) 0.0%
Memory Size 2096
Mean 3.6107
Minimum 0
Maximum 9
Zeros 8
Zeros (%) 5.8%
Negatives 0
Negatives (%) 0.0%
  • occup is skewed right (γ1 = 0.4671)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 3
Median 3
Q3 4
95-th Percentile 7
Maximum 9
Range 9
IQR 1

Descriptive Statistics

Mean 3.6107
Standard Deviation 2.0175
Variance 4.0703
Sum 473
Skewness 0.4671
Kurtosis 0.1152
Coefficient of Variation 0.5588
  • occup is not normally distributed (p-value 1.3499206683604491e-18)
  • occup has 48 outliers

income

numerical

Approximate Distinct Count 39
Approximate Unique (%) 36.8%
Missing 33
Missing (%) 23.7%
Infinite 0
Infinite (%) 0.0%
Memory Size 1696
Mean 6309.6226
Minimum 0
Maximum 50000
Zeros 11
Zeros (%) 7.9%
Negatives 0
Negatives (%) 0.0%
  • income is skewed right (γ1 = 3.739)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 2800
Median 4900
Q3 8000
95-th Percentile 16000
Maximum 50000
Range 50000
IQR 5200

Descriptive Statistics

Mean 6309.6226
Standard Deviation 6834.7511
Variance 4.6714e+07
Sum 668820
Skewness 3.739
Kurtosis 19.2041
Coefficient of Variation 1.0832
  • income is not normally distributed (p-value 6.339218029230788e-09)
  • income has 7 outliers

civil_st

categorical

Approximate Distinct Count 6
Approximate Unique (%) 5.5%
Missing 30
Missing (%) 21.6%
Memory Size 7412
  • The largest value (6.0) is over 1.63 times larger than the second largest value (2.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 1.0
2nd row 6.0
3rd row 6.0
4th row 6.0
5th row 6.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 218
  • The top 2 categories (6.0, 2.0) take over 50.0%
  • The largest value (60) is over 1.63 times larger than the second largest value (20)
  • civil_st has words of constant length

child

categorical

Approximate Distinct Count 2
Approximate Unique (%) 1.9%
Missing 32
Missing (%) 23.0%
Memory Size 7276

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 214
  • The top 2 categories (1.0, 0.0) take over 50.0%
  • child has words of constant length

child_num

categorical

Approximate Distinct Count 5
Approximate Unique (%) 4.8%
Missing 35
Missing (%) 25.2%
Memory Size 7072
  • The largest value (0.0) is over 2.27 times larger than the second largest value (1.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 208
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 2.27 times larger than the second largest value (10)
  • child_num has words of constant length

bro_num

numerical

Approximate Distinct Count 10
Approximate Unique (%) 10.1%
Missing 40
Missing (%) 28.8%
Infinite 0
Infinite (%) 0.0%
Memory Size 1584
Mean 2.6869
Minimum 0
Maximum 10
Zeros 5
Zeros (%) 3.6%
Negatives 0
Negatives (%) 0.0%
  • bro_num is skewed right (γ1 = 1.5009)

Quantile Statistics

Minimum 0
5-th Percentile 0.9
Q1 2
Median 2
Q3 3
95-th Percentile 7
Maximum 10
Range 10
IQR 1

Descriptive Statistics

Mean 2.6869
Standard Deviation 1.7359
Variance 3.0132
Sum 266
Skewness 1.5009
Kurtosis 3.4136
Coefficient of Variation 0.6461
  • bro_num is not normally distributed (p-value 6.6190821548576405e-15)
  • bro_num has 14 outliers

place_bro

numerical

Approximate Distinct Count 9
Approximate Unique (%) 8.7%
Missing 36
Missing (%) 25.9%
Infinite 0
Infinite (%) 0.0%
Memory Size 1648
Mean 2.068
Minimum 0
Maximum 9
Zeros 5
Zeros (%) 3.6%
Negatives 0
Negatives (%) 0.0%
  • place_bro is skewed right (γ1 = 1.9142)

Quantile Statistics

Minimum 0
5-th Percentile 1
Q1 1
Median 2
Q3 3
95-th Percentile 4.9
Maximum 9
Range 9
IQR 2

Descriptive Statistics

Mean 2.068
Standard Deviation 1.5672
Variance 2.4561
Sum 213
Skewness 1.9142
Kurtosis 4.6291
Coefficient of Variation 0.7578
  • place_bro is not normally distributed (p-value 3.1701947108018384e-17)
  • place_bro has 4 outliers

prof_mental

categorical

Approximate Distinct Count 20
Approximate Unique (%) 20.6%
Missing 42
Missing (%) 30.2%
Memory Size 2624
  • The largest value (0) is over 2.79 times larger than the second largest value (4)

Length

Mean 1.567
Standard Deviation 1.3987
Median 1
Minimum 1
Maximum 11

Sample

1st row 0
2nd row 0
3rd row 3
4th row 0
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 126
  • The largest value (0) is over 2.79 times larger than the second largest value (4)

years_mental

numerical

Approximate Distinct Count 17
Approximate Unique (%) 17.7%
Missing 43
Missing (%) 30.9%
Infinite 0
Infinite (%) 0.0%
Memory Size 1536
Mean 2.0031
Minimum 0
Maximum 20
Zeros 51
Zeros (%) 36.7%
Negatives 0
Negatives (%) 0.0%
  • years_mental is skewed right (γ1 = 2.7174)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 2
95-th Percentile 10
Maximum 20
Range 20
IQR 2

Descriptive Statistics

Mean 2.0031
Standard Deviation 4.0109
Variance 16.087
Sum 192.3
Skewness 2.7174
Kurtosis 7.3727
Coefficient of Variation 2.0023
  • years_mental is not normally distributed (p-value 6.298570799989492e-23)
  • years_mental has 14 outliers

hosp_subst

categorical

Approximate Distinct Count 8
Approximate Unique (%) 8.7%
Missing 47
Missing (%) 33.8%
Memory Size 6257
  • The largest value (0.0) is over 7.33 times larger than the second largest value (1.0)

Length

Mean 3.0109
Standard Deviation 0.1043
Median 3
Minimum 3
Maximum 4

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 185
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 7.33 times larger than the second largest value (10)

support_ever

categorical

Approximate Distinct Count 3
Approximate Unique (%) 3.0%
Missing 40
Missing (%) 28.8%
Memory Size 6732
  • The largest value (0.0) is over 2.5 times larger than the second largest value (1.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 198
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 2.5 times larger than the second largest value (10)
  • support_ever has words of constant length

support_years

numerical

Approximate Distinct Count 11
Approximate Unique (%) 11.2%
Missing 41
Missing (%) 29.5%
Infinite 0
Infinite (%) 0.0%
Memory Size 1568
Mean 0.4537
Minimum 0
Maximum 10
Zeros 76
Zeros (%) 54.7%
Negatives 0
Negatives (%) 0.0%
  • support_years is skewed right (γ1 = 4.5604)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 0
95-th Percentile 3.15
Maximum 10
Range 10
IQR 0

Descriptive Statistics

Mean 0.4537
Standard Deviation 1.4104
Variance 1.9893
Sum 44.46
Skewness 4.5604
Kurtosis 23.4284
Coefficient of Variation 3.1089
  • support_years is not normally distributed (p-value 1.1899573044062926e-24)
  • support_years has 22 outliers

work_thirtydays

categorical

Approximate Distinct Count 8
Approximate Unique (%) 9.9%
Missing 58
Missing (%) 41.7%
Memory Size 5508

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 1.0
2nd row 1.0
3rd row 4.0
4th row 2.0
5th row 2.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 162
  • work_thirtydays has words of constant length

work_threeyears

numerical

Approximate Distinct Count 9
Approximate Unique (%) 10.8%
Missing 56
Missing (%) 40.3%
Infinite 0
Infinite (%) 0.0%
Memory Size 1328
Mean 2.5229
Minimum 0
Maximum 7
Zeros 1
Zeros (%) 0.7%
Negatives 0
Negatives (%) 0.0%
  • work_threeyears is skewed right (γ1 = 0.9157)

Quantile Statistics

Minimum 0
5-th Percentile 1
Q1 1
Median 2
Q3 3
95-th Percentile 5
Maximum 7
Range 7
IQR 2

Descriptive Statistics

Mean 2.5229
Standard Deviation 1.4999
Variance 2.2496
Sum 209.4
Skewness 0.9157
Kurtosis 0.5892
Coefficient of Variation 0.5945
  • work_threeyears is not normally distributed (p-value 5.2985257588298194e-14)
  • work_threeyears has 2 outliers

energy_freq

categorical

Approximate Distinct Count 3
Approximate Unique (%) 3.0%
Missing 40
Missing (%) 28.8%
Memory Size 6732
  • The largest value (0.0) is over 6.75 times larger than the second largest value (1.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 198
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 6.75 times larger than the second largest value (10)
  • energy_freq has words of constant length

energy_recent

categorical

Approximate Distinct Count 2
Approximate Unique (%) 2.0%
Missing 40
Missing (%) 28.8%
Memory Size 6732
  • The largest value (0.0) is over 5.19 times larger than the second largest value (1.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 198
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 5.19 times larger than the second largest value (10)
  • energy_recent has words of constant length

energy_cans

categorical

Approximate Distinct Count 4
Approximate Unique (%) 4.1%
Missing 41
Missing (%) 29.5%
Memory Size 6664
  • The largest value (0.0) is over 10.38 times larger than the second largest value (1.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 0.0
2nd row 0.0
3rd row 0.0
4th row 0.0
5th row 0.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 196
  • The top 2 categories (0.0, 1.0) take over 50.0%
  • The largest value (00) is over 10.38 times larger than the second largest value (10)
  • energy_cans has words of constant length

laterality

categorical

Approximate Distinct Count 3
Approximate Unique (%) 2.2%
Missing 2
Missing (%) 1.4%
Memory Size 9316
  • The largest value (1.0) is over 11.9 times larger than the second largest value (2.0)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 1.0
2nd row 1.0
3rd row 2.0
4th row 1.0
5th row 1.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 274
  • The top 2 categories (1.0, 2.0) take over 50.0%
  • The largest value (10) is over 11.9 times larger than the second largest value (20)
  • laterality has words of constant length

ed_score

numerical

Approximate Distinct Count 9
Approximate Unique (%) 6.7%
Missing 4
Missing (%) 2.9%
Infinite 0
Infinite (%) 0.0%
Memory Size 2160
Mean 86.2037
Minimum -100
Maximum 100
Zeros 2
Zeros (%) 1.4%
Negatives 5
Negatives (%) 3.6%
  • ed_score is skewed left (γ1 = -3.7682)

Quantile Statistics

Minimum -100
5-th Percentile 17.5
Q1 87.5
Median 100
Q3 100
95-th Percentile 100
Maximum 100
Range 200
IQR 12.5

Descriptive Statistics

Mean 86.2037
Standard Deviation 38.6416
Variance 1493.1765
Sum 11637.5
Skewness -3.7682
Kurtosis 13.8255
Coefficient of Variation 0.4483
  • ed_score is not normally distributed (p-value 4.930280310580592e-24)
  • ed_score has 12 outliers

amai

categorical

Approximate Distinct Count 7
Approximate Unique (%) 5.5%
Missing 11
Missing (%) 7.9%
Memory Size 8704

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 7.0
2nd row 3.0
3rd row 7.0
4th row 4.0
5th row 3.0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 256
  • amai has words of constant length

amai_score

numerical

Approximate Distinct Count 75
Approximate Unique (%) 58.6%
Missing 11
Missing (%) 7.9%
Infinite 0
Infinite (%) 0.0%
Memory Size 2048
Mean 119.6406
Minimum 16
Maximum 241
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • amai_score is skewed right (γ1 = 0.2224)

Quantile Statistics

Minimum 16
5-th Percentile 42
Q1 79
Median 113.5
Q3 166
95-th Percentile 203.3
Maximum 241
Range 225
IQR 87

Descriptive Statistics

Mean 119.6406
Standard Deviation 52.0505
Variance 2709.2557
Sum 15314
Skewness 0.2224
Kurtosis -0.8798
Coefficient of Variation 0.4351

notes

categorical

Approximate Distinct Count 1
Approximate Unique (%) 0.7%
Missing 0
Missing (%) 0.0%
Memory Size 9452

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row nan
2nd row nan
3rd row nan
4th row nan
5th row nan

Letter

Count 417
Lowercase Letter 417
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • notes has words of constant length

Interactions

Correlations

Missing Values